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ABSTRACT 

Dimension reduction is a key algorithmic tool with many applica- 
tions including nearest-neighbor search, compressed sensing and 
linear algebra in the streaming model. In this work we obtain 
a sparse version of the fundamental tool in dimension reduction 
— the Johnson-Lindenstrauss transform. Using hashing and lo- 
cal densification, we construct a sparse projection matrix with just 
0(i) non-zero entries per column. We also show a matching lower 
bound on the sparsity for a large class of projection matrices. Our 
bounds are somewhat surprising, given the known lower bounds of 
f2(^) both on the number of rows of any projection matrix and on 
the sparsity of projection matrices generated by natural construc- 
tions. 

Using this, we achieve an O(-j-) update time per non-zero el- 
ement for a (1 ± e) -approximate projection, thereby substantially 
outperforming the O ( ^ ) update time required by prior approaches . 
A variant of our method offers the same guarantees for sparse vec- 
tors, yet its 0{d) worst case running time matches the best ap- 
proach of Ailon and Liberty. 

Categories and Subject Descriptors. F.2.0 [Theory of Computa- 
tion] : Analysis of Algorithms and Problem Complexity — General; 
G.3 [Mathematics of Computing]: Probability and Statistics — 
Probabilistic algorithms 

General Terms. Algorithms, Theory 

Keywords. Johnson-Lindenstrauss transform. Dimensionality re- 
duction 

1. INTRODUCTION 

Dimension reduction is a fundamental primitive with many al- 
gorithmic applications including nearest-neighbor search |2 19], 
compressed sensing 111], data stream computations |5|, computa- 
tional geometry 1 13|, numerical linear algebra 1 14, 17, 26,28 1, ma- 
chine learning L8i i33J , graph sparsification [30] , and more; see the 
monograph 1321 for further applications. The seminal random pro- 
jection method of Johnson and Lindenstrauss f2D | consists of just 
multiplying the input vector by a suitably sampled random projec- 
tion matrix — n vectors in d-dimensional space can be mapped 
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into an log n) -dimensional subspace such that the length of 
each vector is distorted by at most (lie). This simple and elegant 
method has the following desirable properties: (i) it is linear, (ii) it 
is oblivious to the input, (iii) it works with high probability for a 
given set of input points, and (iv) the target dimension is indepen- 
dent of d. 

Given its algorithmic importance, much effort has been devoted 
to speeding up the mapping. One line of work achieves this goal 
by making the projection matrix sparse, and hence its multiplica- 
tion with the input vectors faster. Sparsity is typically achieved by 
independently setting each matrix entry to zero with a certain prob- 
ability 1 1 , 2 23 1 . There is however a limit on the extent of sparsity 
achievable by this approach: a result of Matousek [23, Theorem 
4.1] states that such matrices need to contain fi(^) non-zeroes in 
expectation per column, if they were to preserve the length of a unit 
vector with infinity norm at most a. 

Our results. We obtain a sparse random projection matrix of size 
k X d that has 0(i log^(|) log(|-)) non-zero entries per column, 
where k — 0{-^ log(j)). This is \he, first construction with o(^) 
non-zero entries in the projection matrix. (For our results to be 
improvements, we need to assume that log^(-|) = o(i). Our anal- 
ysis, however, does not need this assumption.) 

A highlight of our approach is to construct the projection matrix 
itself with care. Instead of using independent random variables, as 
is typically done, we construct it out of a hash function that entails 
some dependency among the entries. This construction is implicit 
in the work of Langford et al. I2II and Weinberger et al. 1331 , where 
it played a role mostly as a practical heuristic. The hash-based 
construction introduces new technical difficulties, but ensures that 
we have exactly a fixed number of non-zero entries in each column, 
thereby relaxing the requirements on the density of input vectors. 

Specifically, whereas prior work requires that for a unit vector 
X, \\x\\oo ~ O (e), for a constant number of expected entries per 
column of the projection matrix, we only need ||a;||oo = 0{^/e). 
In order to achieve this level of densification, we can use a simple 
replication technique on x 1331 . 

To manage the technical difficulties that arise from the depen- 
dencies, we show that the contribution from each hash bucket is 
bounded, and that the total amount of noise arising from the colli- 
sions in each hash bucket is small. The reduction in overall variance 
comes from the fact that each dimension is mapped to exactly one 
hash bucket, and the lack of self-collisions (which would be present 
if the entries in the matrix were i.i.d.) leads to a reduction in the 
variance of the cross-product error. There are several subtleties in 
analyzing this, in particular, the errors from different hash buckets 
being correlated. We handle this by an application of the FKG in- 
equality on the product of the moment generating function of the 
random variables capturing the errors. This helps us in obtaining a 



concentration on the sum of the errors. Our choice of ±1 random 
variables (instead of Gaussian random variable^]) plays a critical 
role in making our proofs work. 

Implications for sparse vectors. The resulting running time for 
an input vector x having nnz(x) non-zeros is O ^ "^^(.x) ^ — 
ter than the running time obtained by I22II23I for sparse vectors in 
terms of the sparsity ratio ""^'■'^^ as well as by the factor -j-. Fur- 
thermore, using a block-Hadamard based preconditioner, instead of 
a global Hadamard transform, we can actually ensure that our run- 
ning time for all vectors is O (min( ,d)), which is once again an 
improvement over existing results. The qualitative difference in the 
running times is starker in the turnstile model of streaming. Since 
the updates in the stream come as (i, Vi), updating any sketch that 
requires computing a global Hadamard transform is very expen- 
sive, taking 0{d) time per update. Our update time, on the other 
hand, is only 0(7) per entry. 

Our technique speeds up nearest-neighbor computation for sparse 
vectors as well. We can use our construction to preprocess the input 
vectors before applying the algorithm as described in [2. Theorem 
3.2]. The effective running time is then o( ""^(^) _|_ _^ logn + 
^ log n) instead of 0{d log d + log n). For sparse vectors, this 
could represent a significant improvement. 

Related work. Since the original Johnson-Lindenstrauss result, 
several authors have shown that the projection matrix could be 
constructed element- wise using Gaussian or uniform ±1 variables 

[I]|21[l6l[T9). Alon showed a lower bound of (75^^^) on the 

target dimensionality |4|. 

In order to circumvent the sparsity lower bound of Matousek 
1231 . the ingenious Fast lohnson-Lindenstrauss transform (FJLT) 
of Ailon and Chazelle preconditions the input with a randomized 
Hadamard transform thereby making it dense, and then applies a 
sparse projection matrix |2|. The computation of the Hadamard 
transform (via a fast Hadamard transform), however, forces an 0(d) 
running time irrespective of the number of non-zeros in the input 
vector. This makes it less desirable for sparse input vectors. 

Ailon and Liberty (3) showed that the sparse projection matrix in 
f2l could be replaced by a dense, deterministic, but well-structured 
code matrix, and improved the running time to O(cilogfc) over a 
wide range of parameters; however, like before, the running time 
of these methods are unable to take advantage of the sparsity of 
the input vector. Liberty, Ailon, and Singer 1 22 1 proved that there 
exists projection matrices that are applicable in 0{d) time if the 
input satisfies density conditions that are significantly stricter than 
those required for hashing. Since hashing works in linear time, 
our work improves upon these results. Finally we remark that al- 
though 1 3 22 1 contain a spectral condition derived from Talagrand's 
inequality that could be applied to our hashing construcQ, but the 
resulting bound is too weak; it fails to show that hashing improves 
over even the most basic Johnson-Lindenstrauss transform. 

Charikar, Chen, and Farach-Colton |12| introduced the COUNT 
Sketch data structure that used hash tables combined with pair- 
wise independent ±1 random variables for finding the most fre- 
quent items in a data stream. Thorup and Zhang 1 3 1 1 observed 
that this hashing trick could be used to speed up the celebrated 
AMS sketch (5) for estimating F2; this was also noted by Cormode 
and Garofalakis 1151 . Hashing decreases the update time from 

'in fact, we need an average of ^ Gaussians to get a (1 ± e)- 
approximation. 

^It is not hard to see that a of 1221 equals to max{ai} studied in 
Lemma[6] 



log(|)) to 0(log(|)). These estimators, however, are non- 
linear: they return the median of estimates obtained from O (log ( | ) ) 
independent hash functions, which makes them less desirable for 
some applications. Our results essentially show that by increasing 
the update time to 0(i log( j)), the median could be replaced by 
an average. 

Lastly, we note that random projection using hashing has found 
practical applications in machine learning 1 21 29 33 1. In particular, 
the densification by replication was suggested by Weinberger et al. 
1 33 1. Although they claim a concentration bound for hashing-based 
dimensionality reduction, unfortunately, their claim is false due to 
an error in the application of Talagrand's inequality. 

2. MAIN RESULTS 

Let = i| log(i) and c = ^ log (i) log^ (|). Let r = 
{rj}j£[cd] be a set of i.i.d. random variables such that for each 
j G [cd], Pr[rj = 1] = Pr[rj = -1] = 1/2. Let 5^^ = 1 
iffa — p and zero otherwise. Let rinz (x) denote the number of 
non-zero entries in vector x. 

Let h' : [cd] — > [k] be a hash function chosen uniformly at 
random and let H' G {0, be defined as H^j = 5ih'{j)rj. 

Let the pre- conditioner Pg{0,±1} be defined as 

p^. ^ I 7sfor(j-l)c+l<i<jc, 
\ otherwise. 

Let $ = H'P. 

Theorem 1 For any given vector x G R'', with probability 1 — AS, 
$ satisfies the following property: 

< ||<l>x||i < (l + e)||x||i. (1) 

The time required to compute ^x isO {J^ log^(-|) log( j)) ■7inz(2:). 

This is easily implied by the following. Let h : [d\ ^ [k] be a 
hash function chosen uniformly at random. Let H G {0,±1}'''^'' 
be defined as Hij = Sih^^rj; note that the matrix H has only d 
non-zero entries, exactly one per column. 

Theorem 2 For any given vector a; G R'* such that \\x\\oo < 
for e < 1 and 5 < with probability 1 — 35, H satisfies the 
following property: 

{l~mA'?z<\\Hx\\l<[l + e)\\x\\l 

For dense vectors, Theorem[T]gives a run-time of O ( ^ log'' ( ^ ) ) ; 
this, for a small enough e, could be significantly worse than the 
running time obtained by Ailon and Liberty in |3| and Matousek 
in |23|. However, we can modify the construction of the precondi- 
tioner so that we guarantee a running time of 0(ci log clog log c) 
for all vectors. Our new preconditioner is based on the randomized 
Hadamard construction by Ailon et al. l2l[3l. 

Theorems Let d > 6clog(^). There exists a preconditioner 
G G SR'*^'* such that for any input vector x G R'', with probability 
1-45, 

{l~e)\\x\;i<\\{HG)x\;i<{l + e)\\x\\l 
The time required to compute {HG)x is given by 

o(„i„(n=^,o,'(±)..)i„,(l)). 



3. PROOF OF THEOREM H 
3.1 Preliminaries 

Without loss of generality, we can assume ||a:||2 = 1. Let Yi = 
J2jHzjXj = J2j^th{j)rjXj. and let o-,^ = Er[Y^], where E,- 
is the expectation taken with respect to the random variables r = 
{rj}. Thus, 



J eld] 



since the cross-product terms cancel out by the independence i.e., 
Er[r,r,,] = for J / j' . Let = Y,^ - E^y/] = - al 

The outline of the proof is as follows. We need to prove that 
^^Y^ is concentrated around ||a;||2 ~ 1- Instead of showing 
concentration of Y^, we will show that Zi is concentrated 
around zero. Indeed, since our hash function guarantees that each 
coordinate j £ [d] is mapped to one and exactly one hash bucket, 
we have that Eti = ll^lli = 1- Therefore, Eti 'S^^' = 
J2i + I]i = 1 + Zi. Showing that 5^. Z^ is concen- 

trated around zero is thus enough. 

We will utilize the following form of the FKG inequality (6] The- 
orem 6.2.1]. 

Theorem 4 (FKG inequality) Let L be a finite distributive lattice 
and let ^ : L ^ SR^ be a log-supermodular function. Then, for an 
increasing function f and a decreasing function g, we have that 



3.2 Notation 

Recall that fc = -l^ log(i). Define 



a — a{k) 



o"* — — : — , and o = 



4(7 Jit 



we will assume q > 3. We define the following function as a 
shorthand to denote the upper bound on conditional expectation of 
the MGF with respect to the {rj} variables. 

1 45 
G{u, t) = l + —{exp{ue) ~ 1 - u9)t + —. 

Definition 5 (Goodness) A set A Q [d] is good if 
^jeA^^ — '^*- hash bucket is good if h^^{i) is good, 

i.e., if of = h{j)=i — hash function h is good if 

h^^{i) is good for all i. 

For a given h, let Si denote the event that the ith hash bucket is 
good. Let S be the event that the hash function h is good. By abus- 
ing notation we use S and Si to represent the indicator variables of 
the corresponding events. 

3.3 Proof details 

Recall that Zi = Yf^ - E^K^], where y = J2j Sth(i)r jXj, i.e., 

Zi= Yj S^hU)Sih(j')rjrjiXjXji . 
J7^j',i,i'e[ti] 

Observe that E [Zi] — and our goal is to show that "^^i Zi is 
concentrated around 0. 



Here is an overview of the proof. We first show in Lemma[6]that 
most h are good. In Lemma |7] we bound the moment generating 
function (MGF) of the random variable Zi, for a fixed h. A usual 
step at this point would be to remove the effect of the bad choice 
of the random variables from the MGF by perhaps considering a 
truncated random variable Zi = min(Zi, M). In our case, how- 
ever, such a construction would introduce a dependence among the 
{rj} and h variables, which appears to be insurmountable when 
trying to apply the FKG inequality. We have to instead utilize the 
notion of goodness of h only in defining the truncated random vari- 
able Zi . Using the result of Lemma [T] we first get Corollary [8] 
that gives the expected and the worst-case bounds on the MGF for 
a good hash function h. We utilize these bounds to define Zi in 
lO. Next, in Lemma |9l we define two set functions fs and Qs and 
show that they are monotone, in accordance with the requirements 
of the FKG inequality (Theorem|4l(. These functions are then used 
in Lemma [Tol to show that the MGF of "^^i Zi can be bound by the 
product of the individual MGF's Zi. We then bound the probability 
of an e-deviation for "^^i Zi in Theorem [TT] Subsequently, we use 
Theoremll llto prove Theorems [T] and [2] Section|4]gives the proof 
of Theorem [3] 



Lemma 6 //c = ^ log(j) log^(|), then Pr[S] > 1 - S. 



The proof (Appendix 19. Il l is an application of the Bernstein's in- 
equality f24' Theorem 2.7] and utilizes the fact that since \\x\\^ < 
, and the hash function is random, with high probability, no Oi 
can be too large. In essence, this generalizes well-known facts 
about the maximum load in the balls into bins problem for the 
weighted cas^l. 

The following lemma gives a bound on the MGF of the variable 
Zi for a fixed h. The proof can be found in Appendix l9.2l 



Lemma llfu< then for a fixed h, 

E,[exp(«Z01 < G(it,E,[Zf]). 
Lemma|7]leads to the following. 



(2) 



Corollary 8 IfQ < u < then the expectation of the MGF can 
be bounded as 



Similarly, 



Eh,,[exp(MZO 1 S] < G{u,—). 



maxE,. [exp(itZi)] < G(it, cr, ). 



(3) 



(4) 



Proof. By taking expectation over h and using 



Eh[MZ^ i S]] < 2E[Zf] < ^, 



we have that 



2 45 
E,,h[exp(^.ZO I S] < 1 + ^(exp(ue) - I ^ uO) + — 



^Sanders |27| contains a proof of the expected load for the 
weighted ball-and-bins problem, but does not contain a proof of 
the high probability statement. 



The upper bound on F,r[exp{uZi) \ S] is given by 



maxEr[exp{uZi)] 
< 1 + ^{expiuO) - 1 - u6») maxE,[Zf 13] + ^ 

O'' h k 



4S 



< 1 + —{exp{ue) - 1 - ue)at + 



where we use Er[Zf] < a^. □ 



Next, we have to handle the fact that the Zi variables are not inde- 
pendent. Yet, intuitively, since Zi is roughly related to the cross- 
product of the set of entries Xj that map into the ith hash bucket, 
conditioned on the fact that one of the Zi variables has achieved a 
large value, the probability that another Zii is also large decreases. 
In fact, we show that we can apply the FKG inequality (Theorem 
|4ll on the MGF of the Zi random variables. Note that this situation 
is more involved that the simple negative dependence obtained on a 
set of random variables by conditioning their sum to be a constant 
— we cannot make such claims on Zi. For alH = 1, . . . , fc let 
us define 



if Si, 

logG{u,a^) else. 



(5) 



We first need the following lemma in preparation for the appli- 
cation of the FKG inequality (Theorem|4j. 



Lemma 9 For 1 < s < k, u < and A C [d], let us defin 



fs{A) = Er \exp(uZs'J I h~^(s) 



exp j u y ^ Zi 



A and 



A 



Then is an increasing and is a decreasing set function. 

Proof. First we prove that fs is increasing by showing that for 
all A C [d] and for all a € [d]\ A, it holds that {A U {a}) > 
MA). 

Observe that if h ^(s) is good (i.e., if Ss holds), then by Corol- 
lary[8] we have Er[exp{uZs)] < G(u, at). Thus for all h and s, it 
holds from l|5j that 



(6) 



There are two cases to consider. Suppose A U {a} is bad. Then, 
Zs = MogG(ii,cr^) and hence (Au{a}) = G{u,crt) > 
fsiA) from®. 

Suppose A U {a} is good. Now, let us define 

Va= Y1 ^J'^s^i^s and WA = Xa^Xjrj. 

Also note that if h~^{s) = Au {a} and the .sth bucket is good, 
then Za = Zs = Va + ToWa holds. Therefore we have that 



U {A U {a}) = E, [exp {uZ^ \ hT^^s) = ^ U {a}] 

= E,. [exp (uVa + u ■ r-aWA) \ h~\s) ^ AU {a}] 

= E,.[E,. [exp {uVa + u ■ r-aWA) \ h'\s) ^ AU {a}, {rj}jeA] 

1 h-'-is) ^ Au{a}] 
> E,.[exp (Er [uVa + u ■ raWA 1 /r^(s) = A\J {a}, {rj}jeA\) 

1 h-^{s) = A\j{a}]. 

(By Jensen's inequality, E[exp(3;)] > exp(E[a;])) 

= E, [exp {uVa) i h-\s) = A\J {a}] 



^^Er [e^p{uVA) I h-\s)^A] 
'^fs{A). 

Here, (a) follows since only is random in the inner expectation 
and 

Er [uVa + u ■ TaWA \ {s) = Avj {tt}, {rj}jeA] = uVa. 

And, (b) follows since a ^ A and Va does not depend on h{a) 
by the independence of the values of r and h. Finally, (c) follows 
since if ^ U {a} is good then so is A; therefore if h~^{a) — A, 
then we have that Zs = Zs = Va- The proof that fs is increasing 
is complete. 

The proof of gs being a decreasing function is similar, and can 
be found in Appendix 1 9. 3 1 □ 



Given our construction of the two functions, fs and gs, we can 
now proceed to apply the FKG inequality (Theorem |4j to show 
that the MGF of the random variable X^iLi i^ bounded by the 
product of the MGF's of each Zi variable. 



Lemma 10 It holds that 



E 



exp j u y ^ Zi 



Y[E [exp (uZ.'^'j 



where the expectation is taken over both h and r — {rj}. 
Proof. For all 1 < s < fc, we prove 



E 



exp j u y ^ Zi 



exp I uZ, 



(7) 



by induction on s. The base case of s = 1 is obvious. 

Now assume that the inductive hypothesis (|7]( holds for s — 1. 
For all A C [d] let us define 

i^siA) = Pr [h~\s) = ^] = n p^i'^o') = *i n p^i'^o') ^ ^1 

jSA j^A 

It is easy to check that fj,s is a log-supermodular measur^^ over the 
subsets of [d] . Recalling the definition of the increasing function fs 
and the decreasing function gs from Lemma |9] it follows from the 
FKG inequality (Theorem|4l( that 

Em. Ifsgs] < E^, [fs] E^, [gs] . 



■^See [6 Section 6.2, page 87] for a precise definition and proof of 
this fact. 



Furthermore, observe that for any random variable X we have 

E^, [E [X\h-\s) = A]] = 
^ ¥r[h-\8) = A]^[X\h-Hs) = A]=^[X], 

AC[d] 

and consequently, 

E[exp {u'^Zi) exp [uZs)] < E[exp (ii ^ ^i)]E[exp (li^^)]. 

i = l i=l 

Combining the latter with the induction hypothesis for s — 1 con- 
cludes the proof. □ 



Theorem 11 For the variables Zi we have 



(i) Pr 



(ii) Pr 



i 



< exp 



< -e 



< exp 



4(3 + (l + a)eln(|)) 



+ 45 +5, 



12 



+ 5. 



The proof of Theorem 1111 involves a standard but tedious calcula- 
tion that is similar to one done by Matousek |23|. The proof can 
be found in Appendix 19.41 Finally, we are ready to prove the main 
result. 

Proof, (of Theorem HJ. Recall that = Y^. HijXj, l\m& 
\\Hx\\l = yi^- Also recall that erf = Er[Yi^]. Thus, = 

= 1. Therefore, E ti 'i^." = E, + E. ^» = 1 + 
Eti Z^. Recall that fc = i| log(i), and a = Plugging 

these values in Theorem ll If i). we have Ei > with probabil- 
ity at most exp(-| ln(i) + AS) + S < 25, for (5 < Similarly, 
from Theorem II If ii), we have Ei Zi < —e with probability at 
most 2(5. Putting them together, with probability at least 1 — 45, 

IT^^Yi" - 1| = lE.^.I < and hence \\\Hx\\l - \\x\\l\ < 
4x\\l □ 

Proof, (of Theorem[ll(. Theorem [T] easily follows from Theo- 
rem[2]by noting that if y = Px, then \\y\\2 ~ 1 and \\y\\oc < 
The running time is obtained as computing both y — Px and Hy 
requires 0(c ■ n„z{x)) time. □ 

4. PROOF OF THEOREM H 

Definition 12 (Randomized Hadamard matrix f2\.) Construct the 

m X m Hadamard matrix F as Fij = m^^'''^(— 1)^*"^'-'"^^ and 
the diagonal matrix D by choosing each Da independently from 
{ — 1,1} with probability 1/2 for each value. The matrix A — FD 
is defined to be an m x m randomized Hadamard matrix. 

Using multiple small copies the randomized Hadamard matrix, 
we create the following preconditioned Without loss of generality, 
we assume that ^ is an integer, for the given value of b. We note 
that |3 | also contains a similar construct; here we present a more 
straightforward analysis using a different vector norm. 

Lemma 13 Let x G ||a;|| = 1, and 1 > 5 > 0, and c > 1. 
Define b = 6clog(^) and assume b < d. Construct G £ W^'^'^ to 
be a random block-diagonal matrix, where each of the d/b diagonal 
blocks of G consist of an independent copy ofabx b randomized 
Hadamard matrix. Then we have that 



Pr 



IGxIl 



> 



< 5. 



Proof. If ^ is 6 x 6 randomized Hadamard matrix, then for any 
6-dimensional vector z with ||2||2 = 1 it holds that |1^2|l2 = 1- 
Using a Chernoff-type argument Ailon and Chazelle | 2 | showed 



Pt[\\Az\\oo >s]< 2feexp ( — — 



(8) 



holds as well. Observe that the previous inequality trivially holds 
for II 2; II 2 < 1 as well. Let y — Gx, and Gj denote the jth diagonal 
block of G, and partition x and y into | blocks Xj and define yj = 
Gj ajj. Now for a block j, if II II 2 < then |ji/j ||oo < ||j/j||2 < 



holds as well, since Gj is an isometry. Since x is unit length, 
there could be at most c blocks j such that ||a;j||2 > Thus 
setting s to in ([8} and taking the union bound over these at most 



c blocks, we have that 
1 



Pr 



|Ga;|U > 



< 26c exp I 

^ * 2c 



12cMog(f)^« 
27c» 



<<5, 



establishing the claim. □ 



Using the block-Hadamard preconditioner, we are ready to prove 
Theorem [3] The e-approximation guarantee of the projection ma- 
trix <1? follows trivially from the statements of Theorem [2] and of 
Lemma[T3l 

In order to bound the running time, let nnzb(s) denote the num- 
ber of blocks that have non-zero coordinates in x. Then the running 
time of the block-Hadamard based hashing is 0{nnzh (x) ■ b log 6 + 
nnzh{x) ■ b). Now, 

nnzb{x) ■ blogb < min{nnz{x)b,d)logb 

= 0(min(nnz(a::)clog(^),d)log(^)) 

Now, clog(f) = 0(ilog(i)log2|log(^)). Hence the final 
running time is 



„ / . fnnz{x) 
O mm I log 



Note that if 5 is not too small then the running time of Theo- 
rem [3] is comparable to the best existing methods for dense vec- 
tors 1 3 1 yet it is much faster for sparse vectors. We remark that the 
localized Hadamard preconditioner presented in this section could 
also be combined with suitably sparse random matrices from 1 23) 
by making b larger, approximately equal to k. This variant would 
reproduce the results of [3|, but it fails to show any improvement 
for sparse vectors over the naive construction as the running time 
would be n(^) per non-zero element. 

5. A LOWER BOUND 

A random matrix $ is said to have the JL property if for every 
vector X, ^x satisfies ([TJ with probability 1 — 5 over the choice of 

We show a lower bound on the sparsity for a class of construc- 
tions of matrices with the JL property. The construction of the ma- 
trix is modeled as a two stage process: first, the set of indices that 
have non-zero entries is chosen, and then each column is chosen in- 
dependently random. Note that we do not assume that the random 
variables are independent within a column. 

The lower bound argument of Matousek 1231 shows that if the 
set of non-zero indices in the first stage is chosen by independent 
coin tosses and if the random variables in the second stage are in- 
dependent (scaled) ±1 with equal probability, in expectation, then 



Cl{ Jji°° ) non-zero entries per column are needed to guarantee that 
the resulting matrix has the JL property. 

We show a lower bound on the sparsity for the case when the 
non-zero indices are chosen arbitrarily. As mentioned earlier, if the 
random variables in the second stage are A'^(0, 1), then it is easy to 
obtain a lower bound of on the number of non-zero entries 

per column: indeed, the lower bound follows since such 
random variables are needed so that their sum is (1 ± e), w.h.p. 

Under mild technical conditions on the random variables, we can 
prove the following lower bound stated in Theorem [14] It is easy 
to see that the conditions of Theoremll4lare satisfied if the random 
entries are independent (scaled) ±1 or when they are generated 
by the replicated hashing construct of Theorem [T] Thus the upper 
bound of Theorem [T] is tight with respect to e. The bound on the 
number of non-zeros per column implies a bound on the worst case 
update time over all vectors as well. 

Theorem 14 Let l<c<k<dbe integers and M be an ar- 
bitrary, fixed or random, k x d 0-1 matrix with at most c non- 
zeroes per column. Let P be a k X d random matrix of the fol- 
'O ifMij^O 



lowing fi)rm Pi^ 



1. 



Here the vector valued 



Utj random variables are independent and for each j it holds that 
EEi Pi] = 1 'hat EE, Pi",] = OC-). 

Let Q < e < l/A. If P has JL property with probability at least 
1 _ £W, then 



1 \/logfe(d) 



Proof. For alH = 1, . . . ,d let d ^ {s £ \k]\Ms, / 0} 
denote the index set of non-zeros in the ith column of P. Further- 
more, let V = {ei, . . . , Ed}, where ei denotes the ith unit vector. 
For i ^ j we also define Xij — df] Cj and 5* = "^^tex- UuUtj- 
Then we have that 



By our assumptions it holds that E[F/] = E[C/t*]E[!7y] = O(^). 
Ifs^t then F.[Y^^Yt^] = E[Us^Uu]E[UsjU?j] holds by indepen- 
dence and hence from Holder's inequality we have thatE[UgiUti] < 
\/WUfi\EWE- Thus it holds that E[Y,^Yt^] < O(^). Lastly, re- 
call that that for all s, ti, t2, ta where s ^ {^1,^2, is} we have that 
ElYsYt.Yt^Yt^] =0. 

Now (9) Theorem 3.5] states that 



Pr 



|S|>i\/E[s5] 



(3/4)2 



E[S^12 



7_ 

16 



Therefore we have that 



Pr 



\S\ > 



2c 



(3/4)= 



-=t7(l). 



(10) 



On the other hand, it follows from the assumed JL property of P 
that with probability 1 — o(l), for all 1 < i < j < d, we have that 
i WPi^i + ej)||i - 2| < 2e and that 



||Pe,)||^-2| <2e. 



Therefore from combining equality ^ with inequality l llOt it fol- 
lows that -^^S! < 4£ must hold for all i 7^ j, or equivalently 
\Ci n Cj\ <z with z = IGe^c^ for all i / j. 

\f z < 1, then the d are pairwise disjoint and therefore k > 
dc > d, a contradiction. Thus z > 1, and hence c > -r|- imme- 
diately. In what follows we strengthen the latter lower bound for a 
large range of d and k as claimed. 

If c > 3^ then the lemma clearly holds as f2 (p-) is the largest 
of the lower bounds claimed. 

Now note that c > 2. Since if c = 1 were to hold, then 
from e < 1/4 it follows that z < 1, which is a contradiction 
as before. Therefore if c < then all d's are distinct as 

z + l = {c/2){32e^c) + 1 < c/2 + c/2 = c holds. 

Observe that any z + 1 element set is contained in at most one 
d . Therefore the number of distinct d is at most 



\Pie^ + ej 



2 + 



\\l + 2S. 



(9) 



Using the fourth moment method li9j, we show that S has a large 
deviation with constant probability unless c is large enough. To- 
wards this goal for all t £ Xij setYt = UtiUtj andlet Xij — \Xij\. 

W.l.o.g. we can assume that each column of M contains exactly 
c non-zeroes and if Mu = 1 then E[Uti] = ^ and E[Uu] = 
0{-ij) hold as well; otherwise we replace P with a copy of P 
whose rows are randomly permuted. Furthermore we can also as- 
sume that E[UaiUti] ~ holds as multiplying each row of P with 
independent uniformly distributed ±1 random variables does not 
change ([9} or the theorem's conditions. Finally, w.l.o.g. we can 
assume that for all s,t\,t2,t-j, where s ^ {ii, t2, ^3} it holds that 
E[ysYtiyi2^t3] = as multiplying the rows of P with random 
±1 ensures the latter condition as well. 

Now observe that E[S"] = E[Et Y?] + E«^t E[YsYt] = ^ 
holds, since E[Y?] = EplU^] = E[Ul]E[Ui,] = ^ by the 
independence of columns. Moreover if s 7^ t then we have that 
E[YsYt\ = E[Us^UuUsJUtJ] = E[Us^Uu]E\UsJUtJ] = by in- 
dependence again. 

Similarly note that 

E[S^] = E[^ y,^] + ^ E[yM\ + J2 nYsYt,Y,Y,] 



f{k,c,z + l) = 



k 

z + 1 



z + 1 



a well known fact from block designs and set packing 1181 . From 
the Stirling formula, for all n > l,^2m(f)" < n! < l.lV2m{^y 

and it follows that for all 1< y < x it holds that ( -2i . /i^ < 
(y) - (^) '7h\f% 



Therefore we have that 



/(fc,c,z + l) < 



2k < k 



z+3 



(11) 



Now observe that d < f{k, c, z+1) as all d are distinct. Com- 
bining the latter with inequality i ll It . we have that logj. (d) — 3 < z. 
Recalling that 1 < z = l&t^tp' concludes the proof. □ 

Using a replication argument it is easy to see that if a matrix P 



Ikll 



< a for some a. 



only has the JL property for vectors x with 
then under the conditions of Theorem [14] we have that at least one 

column of P contains f2 



1 

7^' 



non-zeroes. 



If the fourth moment of the random entries per column scales 
with the number of non-zeros per column, the next theorem strength- 
ens the previous claim by bounding the average number of non- 
zeroes per column. This condition is satisfied, say, if the non-zero 
entries are independent scaled ±1 random variables. 



Theorem 15 Let < e < 1/4 and M be an arbitrary k x d 0- 
1 matrix with 2k^ < d. Let Cj denote the number of non-zeroes 
in the jth column of M. Let P be a k x d random matrix of the 



following form Pij — 



ifhUj 
ifM^j 







Here the vector valued 



Utj random variables are independent and for each j it holds that 
If P has JL property with probability at least 1 — ^^fi^ 



, then 



1=1 

Proof. Set 



1 yiogllrf) 



1 v/logfc(rf) 



For all j = 1, . . . , fc, assemble the columns of P with Ci = j into 
the k X rij matrix Pj . For all j if Uj > k then from assumed JL 
property of P it follows that Pj satisfies the conditions of Theo- 
rem[T4]with c = j and thus j > s. 

Therefore for all j < s we have that rij < k. The number of 
non-zeroes in P is X]i=i = X]j=i ^jJ' which we lower bound 
as follows 

k k / k s-1 \ 

"^oJ >YnjS= I ^ Uj - ^nA s> {d~ sk)s 



6. EMBEDDING INTO h 

We can show the following result for the case that the target 
metric is l-i. The result and the corresponding proof is similar 
to that of Ailon and Chazelle |2|. We construct the matrix H as 
follows: Hij — 5if^(^j)rj, where rj are now drawn i.i.d. random 
variables A''(0, 1) instead of being ±1. We then have the follow- 
ing. Let /3o = E[|2;|] where z ~ N{Q, 1). By the 2-stability of 
the normal distribution, Yi — XjrjSif^Qf ~ N{0,ai) where 



= E 



Thus, Er[\Y,\] = a,Po. 



Theorem 16 There exists a constant eo such that for all e < eo, if 

c = k/e, and k = 0(^log(i)), Y = -^^Y^JYtl we have 
thatY'Y[\Y - 1| > e] < (5. 

The proof is omitted in this version. 

7. DISCUSSIONS 

The most important open question is resolving the gap between 
the upper and lower bounds with respect to the error probabihty. 
It would be interesting to see whether our claims could be proven 
more directly using stronger concentration inequalities. 

Application of the current result to streaming settings would also 
require proving the claims for a fc-wise independent hash-function 
and ±1 variables. The chief hurdle in applying the techniques of 
Clarkson and Woodruff 1 14] seems to be proving the FKG inequal- 
ity for the hmited independence case. Note that Nisan's pseudo- 
random number generator construction [25 :1 can be used to deran- 
domize the hash function, but the naive way of doing this increases 
the update time to k. We leave efficient derandomization as an open 
question. 



It is worthwhile to note that the hash-function represents a bipar- 
tite expander. In a similar vein, Berinde et al. 1 10] use an unbal- 
anced expander graph based construction to create matrices with 
restricted isometry property for sparse signal recovery. Their ar- 
gument crucially uses two facts — that the error-norm is l\, and 
that the input vector is sparse. It would interesting to investigate 
possible connections between these results. 
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< exp 
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k + ak/3 



Since q > 3, 



Pr 



2 1 a 



< exp 



V2\ 



2ka/3 J 



< exp 
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Choosing c = |^ log ( | ) , we get the above probability to be smaller 
thanS/k. Since a — ^log^fc/^) , and k = i| log(l/(5), we have that 
choosing c — ^ log( j) log^(f) is sufficient. □ 



9.2 Bounding the MGF's 

We first compute the expectation of the IVIGF for different con- 
ditions on the hashing function. We begin by proving Lemma|7] 

9.2.1 Proof of Lemma^ 
Proof. We have that Zi — X]_,7^ggh-i(i) ^j'''9^j^9 ■ Hence 



E2 _ y2 _ 2 
Xj (7^ 



where Y, = E,:h(j)=« ^'^j- Then, 



9. APPENDIX 



9. 1 Proof of Lemma |6] 



Er [exp(iiyi)] — Er [exp(urja;j 

j:h{j) = i 

= n ( ^exp(ua;j) + iexp( 



2 2 



UXn 
2 ' ^' 



j:h(j)^i 



< exp 



Proof. We show that af < aj, with probability 1 



5/k\ the 



proof will then follow from the union bound. 



Define the random variable Xi — S 



and since ||2;||c 



< we have X, < - 



lh(j)Xj 
3 



Fih[X'j\ = Eh ( 



, and 



1 

fc2 



Then, Eh[X,] = 
We also have 



x^\ 



< 



1 

fc2 



Also, ^ Xj = af — ^. Plugging this into the Bernstein's inequal- 
ity (lH Theorem 2.7], 



By the Markov inequality, we get the probability of Yi being larger 
than t as 



Pr [v; > t] < 



E, [exp(^tyO] ^ ^^P(^ — j 
exp{ut) ~ exp{ut) 



by choosing u = Note that we do not need to worry about 

ai being zero, as then Yi — 0. Then, we bound Er[exp(MZi)] 
as follows. Denote p{t) = Pr[Zi — t]. We first compute the 
expectation with respect to r. For any value of 6* > 0, we have 

Er [exp{uZi)] — exp{ut)p(t) 

t G ( — oo ,oo) 

< eyip{ut)p{t) + Yexp{ut)p{t). 

te(-oo,9] t>e 



The first term can be bounded as follows: 



By putting together the two parts, we have that 



te(~oo,e) te[o,9] V 3=2 ^' ) 



tG( — oo.l 



te(— oo.e 



te(-oo,9] j=2 



tG( — 00,00) tfz{ — OC,Oo) 



tg(-oo,e] j=2 



where the last inequality follows since in the range [6, 00], the inte- 
gral is positive. Then, the calculation can be simplified as follows: 

J2 exp{ut)p{t)< J2 {p{-t) + utp{-t)) 
te(— 00,9) tg(-oo,oo) 



tS{-oo,e) j=2 



te(-oo,6l) j=2 

< 1 + 2-^ — ~Pv') smce t < 6 m this range 

< 1 + ^(exp(ue) - 1 - ue)¥.r[Z^\. 
For the second term, we have 

Y exp{ut)p{t) 
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< f:exp(^+l)exp(-^Z^ 
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< ^ ^ exp 
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By assumption of the lemma, since it < jij-, we have that £ 
< — -r-^ . With this restriction, 

ZU(7- iua- 



Y exp{ut)p{t) < • ^ exp ^^^^ 
t>0 e=ue ^ * 

< 2yi. exp = 2^. exp (-^) < 4exp 



Er [exp(iiZO] < 1 + ;s^(exp(u6») - 1 - ue)E,[Zf] 



+ 4exp(-^ 



Choosing 6 — 4(tJ ln(A;/5), the proof is complete. □ 

9.3 Continued proof of Lemma |9] 

We finish the proof of Lemma[9]by showing that Qs is decreasing. 
To this end, we prove that for all A C [d] and for all a £ [d] \ A, 
gs (A U {a}) < (?s (A). Recalling the definition of Qa {A), we have 



Qs (A) = E 



exp yj-YlZij \ h ^{s) = A 
expU^ZJ \Vj:h{j) 



I h-\s) = A 



(12) 

where the inner expectation is over the random variables {rj} only. 
Since h is completely independent we have that 



9s (A) 



E 



(hi,.-.,ha)elk]<i, 

Vj:hj=s-^jeA 



l)d-|A| 



and similarly 

gs iAu{a}) 



E 



E[exp(nE:r^^») I Vj : Mj) = ^3] 



(fc-1) 



d-|A|-l 



Therefore it is sufficient to show that for all 

{hi,. . . , ha-i, ha+1, ■ ■ ■ , hd) e [kY~^ 
with Vj / a : /ij = s <4> j G A it holds that 

E[exp(uE:ri \ -^j--h{j) = h,] 



E 

haG[fc]\{s} 



fc- 1 



> E 



exp u Zi ] \\/j^a: h{j) = hj , h{a) = s 



We shall prove the following stronger inequality: for all 

{hi,..., ha-i,ha+i, ... ,hd) e with Vj ^ a : hj ^ s ^ 

j £ A and for all fea G [fc] \ {s} it holds that 



E 



expU^Z. lVj:/i(i) 



> E 



exp i V j / a : h{j) = hj,h{a) = s 



Now observe that only r are random in the above expectations and 
that Zi are conditionally independent given h. Therefore, 



exp I Yj 



Y\ E [exp [uZ^] \h 



From the non-negativity of the exponential function, it follows that 
it is sufficient to show that for all i = 1, . . . , s — 1 and for all 

{hi,. . .,ha.i,ha+i, . . . G [k]"^'^ withVj a : hj = s 
j € A and for all ha G [fc] \ {s} it holds that 

El > Er, where (13) 

Ei = E [exp (uZ.'j I Vj : h{j) = 

Ejj = E [exp (^uZ?j i Vj / a : h{j) = hj,h{a) = s] . 

We prove inequality l ll3t by a case analysis. If ha 7^ i, then 
Ei, = Eij by definition. If ha ~ i and the ith bucket of Ei,'s hash 
function is bad, then El ~ G{u, af) > E_r, as shown earlier in 
Corollary [8] 

If ha = i and the ith bucket of El 's hash function is good then, 
the ith bucket of En's hash function is also good. As before, define 

14=^ ^ rjrgXjXgSh(j)iSh(g)^, 

jT^a 37^1 J^g 

and 

Wa = y^JjXjXa5h(j)i5h(a)i- 

Again, note that if the ith bucket is good as assumed then Zi = 
Zi — Va + TaWa. Therefore we have that 

El = E[E [exp {uVa + u-raWa)\ : h{j) = hj.yj / a : r^] 
\Wj:h{j)^h,]. (14) 

Now observe that only Va is random in the inner expectation and 

E [uVa + U ■ TaWa \ Vj I h{j) = hj ,\/ j a : T j] = uVa. 

Thus from E[exp(a::)] > exp(E [x\), it follows that 

E [exp {uVa + u-raWa) I Vj ; h{j) = hj , Vj / a : ^j] > exp (uVa) 

as before. Plugging the latter into il4\ we arrive at 

El > E [exp (uVa) \ Vj : h{j) = hj] (15) 
= E [exp (uVa) ! Vj / a : h{j) = hj,h{a) = s] . 

Here the last equality follows from the fact that all r and h values 
are independent and since Va does not depend on a. If h~^{s) = 
A U {a} and the i hash bucket is good as assumed then Zi — Zi = 
Va and we observe that 

E [exp (uVa) I Vj / a : h{j) = ftj, /i(a) = s] = Eb (16) 

Combining l ll5t and l ll6t , we conclude that El > E^ for all cases 
and hence ga is decreasing as claimed. 

9.4 Proof of Theorem M (i) 

Proof. Recall that the random variable Zi is defined as 

'Zi ifg„ 



Z^ = 



^logG{u,at) else. 



Note that G(u, at) > 1, and hence for u > 0, ^ logG'(M, at) > 
0. Also recall that Si is the indicator vector for bucket i being 
good, and S is the indicator for the hash function being good. By 
definition of Z^, since i log G{u, at) > 0, J2i Zi A S < J2i 
and hence we have that. 
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Pr[S] 



+ 5. 



(17) 



Thus, we prove a bound on E[exp(M J]]^ Zi)] and thus bound on 
Pr[5]]i Zi > e]. Taking expectations over both r and /i, and using 

Pr[S] < 5, 

E[exp(ttiO] < E[exp(uZO | 3] Pr[g] + G(ii, aj) Pr[g] 
< 1 + ^(exp(.*e) -l-uO) (^^(1 -S) + atS^ + 

where we combined the appropriate terms from the two parts of the 
sum. Recall that < 1. By choosing S < -p-, we have that 

E[exp{uZ,)] < 1 + l^{exp{ue) ~ 1 ^ u9)^ + ^ 



Taking the product over the k terms, by using Lemma [Tol 
E[]^exp(7iZ0] < nE[exp(uZ,)] 



< exp ( — (exp(u6i) - 1 - uO) + AS 



For completeness, we show how to determine the optimal u. 



Pr 



j:z.>e 

i 

2 



< ENp(.E,iO] < rrE[exp(uiO]exp(-ue) 
exp(ue) ±± ^ " 



< exp ( — (exp(M6') - l-u9) + S -ue] < exp {H(u) + 5) , 



2 {exp{ue) 



where we define H{u) — 
the derivative H'{u) — 0, we get 
and hence 



1 , / k€6 
^ = - In ( 1 + — 



- 1 — u9) — ue. Setting 
exp(it6') - - e = 0, 



4cr| \-a(k/5) 



In 1 + 



Akea, 



4fcto^_ln(fc/£2 , 

Note that we need to restrict it < . We need — , ,, < 



1, which is true if setting 



4ktcT% lii(fc/4) 



2 24'°''^*^ 44(1+0,) ln{fc/4) ' 

which is permissive. Using this value of u, we have that (skipping 
the simplifications) 

2 fkee / kee\ , / kee 

H = — IH In lH 

k9^ \ 2 \ 2 J \ 2 

. 2 ^ . ^ke^ 



ke-^2 + 



2(2 + ^)' 



which is the trick that Bernstein uses: (1 + x) ln(l + x) — x < 

_ 2 

2+2x/3 • Pl^SSi'iS in this value of H, we get that 



Pr 



> e 



< exp 



4(3 + (l + Q)eln(fc/(5)) 



+ 4(5 



□ 



The proof of Theorem ll If ii) is similar to the above and is omitted 
in this version. 



